Lucía García-Duarte Sáenz


Analysing the NYC Taxi Trips

Every Thanksgiving, hundreds of thousands of New Yorkers leave the city to visit their relatives, while many others decide to stay. The New York City Council is interested in knowing how busy streets are and where people travel inside the city during this important holiday. So, the aim of this work is to analyse the New York City network of taxi trips to discover further insights regarding how people move within the city, which are the busiest hours, or what factors have the greatest influence on tips. The dataset to be used can be found here.

To do so, different tasks will be carried out: (1) data exploration and cleaning to preprocess the data and prepare it for the analysis, (2) data summary to find out interesting features, insigths and patterns within the data, and (3) model building to better understand what is the typical tip clients leave to cab drivers depending on the trip and what factors influence this amount the most.

1. Data Exploration and Cleaning

Before summarizing the main features of the data and obtaining a relevant understanding about taxi trip tendencies during Thanksgiving, it is important to prepare, clean and complete the available data.

First, let us load the data and see some samples:

The dataset is composed by 174347 registers and 17 variables.

Now, let us check that all data belongs to the specified period of time, which is Thaksgiving's Day 2018:

Indeed! That is the only date we where looking for, so we don't need to remove entries here. Now, before start looking at the data for cleaning and exploration, we must check if there are some missing values. In that case, we would have to perform imputation to complete the dataset.

As seen, there are no unobserved values! Therefore, imputation is not needed and we can proceed.

Explore the variables

Here, we are going to visualize different variables and perform some calculations to detect abnormalities, incorrect values or values that do no make any sense. We will also make use of this dictionary to make sure and verify whether the recorded data is correct when compared to their description. Additionally, we will make use of this look-up table to get knowledge about the different locations were taxis pick up and drop off passengers.

Let's begin by plotting some variables:

From the graphs and the data dictionary, we can draw the following conclusions:

If we look at the aforedmentioned lookup table, locations 264 and 265 are not defined (N/A). Therefore we will remove locations with ID greater than 264. Let's look at how many registers are wrongly stored for each pickup_location_id and dropoff_location_id variables:

As seen, there are a few many registers that we have no knowledge of from and to the cabs went.

Additionally, there are variables that seem numerical but are indeed categorical according to their description in the dictionary. Let us check the values that these variables have:

According to the dictionary:

So, we must eliminate those registers that do not have such values, including negative entries.

Lastly, there are many variables that cannot take negative values, which are those related with money, speed, time and distance. See that here we have constructed two additional variables, diff_datetime (trip duration) and velocity (average speed in km/h), for further insights. Again, let's look at how many registers are stored as negative (notice that in some cases values cannot be zero either, i.e. speed, time of fare):

Before continuing, we are going to consider the features velocity, diff_datetime and trip_distance, and plot their histograms and boxplots to look for inconsistencies.

It can be seen that the distributions are skewed to the right, so mostly short trips occurred. Also, we can observe that there are very large values for such variables, which do not make much sense. Therefore, we are going to set some upper thresholds (dashed black lines in the graphs) to eliminate outliers:

In the next table we can see how many registers are considered to have extreem values per selected variable:

Notice that there are other aspects that can be checked, for instance:

Data cleaning

Now that we have analysed the data and highlighted incorrected entries, we are ready to filter and clean the dataset. Here we present two different ways of doing so:

  1. Basic python: the data is filtered using different conditions over the dataframe in the python environment.
  2. PySpark: the data is filtered using a distributed environment. For big data volumes, this parallel approach greatly reduces the computational time.

Up to now, we have been able to explore the data and perform data quality and cleaning methodologies. However, potential outliers still remain in the data. For instance, lets consider the variable diff_datetime. From the graphs it can be seen that there exists outliers in the data. Indeed, for durations that last less than 2 minutes, the price is drastically too large in some cases. Therefore, we will remove trips with durations lower than 2 minutes.

See that more robust approaches could be performed to further improve the quality of the data. For instance, the Minimum Covariance Determinant (MCD) estimators method tries to find observations whose empirical covariance has the smallest determinant, yielding a subset of data 'free' from outliers. This could be interesting for future studies.

Now, the number of registers has been reduced by 9.6%, and we still have a substantial amount of data to start the digging into the interesting trends and discoverings.

2. Data Summary

In this section, we will use the previously cleaned data to spot some interesting features and try understand the urban mobility in NYC caused by the celebration of Thanksgiving.

For this purpose, we will first, and similarly to what we did before, take a look at the different variables and try discover diferences between short and long trips. We will also perform some operations by grouping the variables so to be able to found out curious trends. Additionally, a spatial analysis will be carried out to better visualize the outcomes in a geolocalized way.

Exploratory analisis

The data was divided into short and long trips by spliting the data according to the distance travelled. We considered 10 km as a limit to categorize the taxi trips into this two labels.

First, we will consider the numerical variables:

From the plots below, the following can be seen:

Let us now consider the categorical variables:

Main insights can be summarised as follows:

Now that we have found out several information related to taxi trips during Thanksgiving, we must dig a little bit more. Let us discover which are the peak hours where most people took the taxi.

We can see how the number of trips is greatly reduced at night, as more people is sleeping and some party and stay-late lovers are probably going back home. We can also see two highest peaks during lunch and dinner time, as people is reuniting to spend Thanksgiving together. Additionally, in the plot below we can see how the number of passengers is proportional thoughout the day, being 1 the leading tendency:

On the contrary, we can also visualize how the average speed of the taxis increases at night and decreases during the day, probably because there is more traffic and the streets are busier. At night, people is more prone to speed up their wheels.

Interestingly, the following plot shows that at night, trips were much longer! And trips were also more expensive (which explains the extra overnight charge that is applied), due also to the fact that travelled distances were longer. Probably you have already noticed this drastic peak at 5am. Let's focus on that to try to understand what happend at that specific time.

If we extract the top-5 pick-up and drop-off locations at 5am, we can come to a conclusion. The location with most drop-offs was an airport at Queens (LaGuardia)! This airport primarily accommodates airline service to domestic (and limited international) destinations, which explains many people travelling very early in the morning to spend the Thanksgiving day with their loved ones living in a different state. Also, it explains the increase in price, as airport taxi trips are usually expensive. Additionally, the top-3 location is also an airport, while the 5th location is Penn Station, a ferrovial station which serves travelers from not only New York City but also Long Island and New Jersey, explaining people traveling to these two destinations form NYC. Lastly, the most demanded pick-up location was JFK Airport, from people coming back home to visit friends and family for the holiday.

Spatial analysis

We have seen how the hour of the day affected the costs, duration and speed of the trips. Now lets take a look at the spatial features to understand and visualize most common trips and keep on gaining knowledge about how busy streets are and where people travel inside the city during this important holiday.

First, we are going to visualize the network of NYC taxi trips, defined as follows:

To be able to extract the links and weights, an R code was developed (see notebook Annex I, section Chunk 1). The extracted files containing node and link information were imported to Gephi software to visualize the network using the 'Geo Layout' distribution, which considers coordinates to locate each node. The number of trips from one node to another is represented in a scale from dark (greater) to light (lower) green. A different representation was made, showing in black those links with weights larger than 100, meaning that more than 100 trips where done between connected stations (considering directions).

map

From the two plots it can be seen that most trips occurred within the borough of Manhattan. The fact that it is considered to be the downtown of NYC explains the huge number of trips during the day. Additionally, lots of trips from many different regions start or end at JFK Airport and LaGuardia Airport, which is related to the previous findings regarding airports and mobility to travel in and out the city.

map

Let us now load the data to continue with the exploration:

First, let us find the most common trips:

Related with the previous findings, all the locations in this top-5 belong to the borough of Manhattan. And what if we take a look at the most common short and long trips?

Again, the most common short trip occurred between two locations located in Manhattan, very close to each other. And the most common long trip ocurred between Times Squares, also in Manhattan, and the JFK Airport. All of this is aligned to what we saw earlier.

Additionally, different dynamic plots where built in R for deeper insights. Here are added screenshots, but you can refer to notebook Annex I, section Chunk 2 to play with the maps. First, a map of NYC was shown indicating both the total amount of pick-ups and drop-offs in each zone. These plot resembles a heatmap representing the most transited zones. In line with previous statements, the mayority of pick ups and drop offs occured in popular areas such as Manhattan or airports, and the further away we move from the city center, less trips occur.

map1

map2

Aditionally, as stated, most passengers travel alone. But still there are some areas where people go in groups:

map3

map4

Model building

In this last section we will try to infer tips based on the available data and we will discuss the results. We will first train different models and select the one yielding the most accurate results, to later analyze which features were more important for predicting the tips.

The pipeline for model building is the following:

  1. Feature engineering
  2. Preprocessing: encording and standarization
  3. Feature selection
  4. Model training

Recall that at the very begining, we saw that there were no missing values, so imputation won't be included in this pipeline

Feature engineering

In previous steps, we created new variables for the analysis, velocity and diff_datetime, and we will use them as well for predictions. In addition to that, we will convert pick-up and drop-off datetimes into new separate variables: hour (already constructed before), min and sec.

Let's check the column names:

Before continuing, we are going to split the data into training, validation and test. We will keep the validation data to perform hyperparameter tunning (inner evaluation) and the test data to evaluate the performance of the model (outter evaluation).

Preprocessing

Now, we will standardize the numerical variables so that all variables give equal contribution to the model. Also, to avoid creating numerical dependencies among categorical variables, we will one-hot-encode them. The following chunk of code will be included in each model pipeline as the preprocessing step.

Feature selection and model training

Now, we will select features according to the k highest scores using the Select K Best method, being k a hyperparameter, and we will train different regression models: a Dummy model, and two simple models, namely (1) a K-Nearest Neighbour (KNN) model and (2) a Decision Tree (DT) model. The trivial/dummy model will be used as a reference, and the other models will be trained with and without hyperparameter tuning.

Trivial model

KNN - default hyperparameters

Decision tree - default hyperparameters

Model evaluation and selection

In the following table we can see the results of the trained models. As expected, the dummy model performs way worst than the KNN or the DT, though it is much faster due to its simple computation. Additionally, the KNN not only takes much longer to train but it also gives worst results than the decision tree models. Even though both DT models perform very similar in terms of outer validation, we will select the default model as the optimal one because it is faster. Notice that the fact that the outer RMSE is lower than the inner, indicates that the model is not overfitted.

Discussion

Now that we have selected the optimum model, let us take a look at the predictions to see how much they resemble the ground truth.

On the left-hand graph, the 'ideal' model is represented by the black line. The further away we move from this line, the worst are the predictions. Notice that most of the dots lie on/close to the line, and no tips are estimated to be negative. Large tips are harder to predict, as points spread more when they are away from 0. This occurs because we have less data samples with huge tips and many entries with lower amounts, so the model better learns how to predict smaller values from the input data.

The right-hand side, we can see that the distributions are very similar, though for very small tips, the model tends to output 0, which can be seen by looking at the huge peak at 0 dollar tips followed by a large drop just after that. This can also by seen on the previous graph by looking at its bottom left part.

Another interesting thing is to look for the variables that play an important role in predicting the tips. As observed from the following graph, from more than 500 features, only a few ones are actually providing useful information. Our model has selected 10 features (remember that the selected model was the one with default parameters), which appears to be good enough to obtain appropriate estimations.

Lets see which are these 10 features:

It can be seen that most of these variables are numerical variables related with money: the total cost of the trip is directly related with the costs that add up, including tips. Also, tips are influenced by the distance, speed, and duration of the trips, which are also directly related among each other. Whether the applied rate is standard or not and whether it is a JFK rate or not is of great importance as well. In addition, the fact that payments were done with cash or credit card is important to estimate tips, while being charged or not is not meaninful for this purpose. On the contrary location variables such as pick-up and drop-off zones do not seem to aid in predicting tips.

To sum up

We have extracted some useful knowledge about NYC taxi mobility during Thanksgiving. The main aspects can be summarized as follows: